Dependence of folding rates on protein length 
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Using three-dimensional Go lattice models with side chains for proteins, we investigate the depen- 
dence of folding times on protein length. In agreement with previous theoretical predictions, we find 
that the folding time tf grows as a power law with the chain length, i.e., tf ~ N x , where A ~ 3.6 
for the Go model, in which all native interactions (i.e., between all side chains and backbone atoms) 
are uniform. If the interactions between side chains are given by pairwise statistical potentials, 
which introduce heterogeneity in the contact energies, then the power law fits yield large A values 
that typically signifies a crossover to an underlying activated process. Accordingly, the dependence 
of tf on N is best described using tf ~ e^ . The study also shows that the incorporation of side 
chains considerably slows down folding by introducing energetic and topological frustration. 

PACS numbers: 75.40.Gb, 74.72.-h 



I. INTRODUCTION 

Protein folding mechanisms depend not only on the 
architecture of the native state, but on the external con- 
ditions (pH, salt concentration, temperature, and molec- 
ular environment). Several recent studies have argued 
that the folding rates (presumably, under the conditions 
of neutral pH, zero salt concentration, room tempera- 
ture, and the absence of molecular crowding) is deter- 
mined solely by the architecture of the native structure 
[1] . Although the native topology does constrain the en- 
semble of transition states (the folding nuclei have to be 
topology preserving [2]), other factors, such as protein 
size and the native state stability, also play a role in de- 
termining folding rates and mechanisms. For instance, a 
direct correlation between folding rates and stability has 
been noted by Clarke and coworkers [3]. They showed 
that for five proteins, all with immunoglobulin-like fold, 
the folding rates kp correlate well with the native state 
stability. On the other hand, there is a poor correla- 
tion between kp and the relative contact order [1], which 
quantifies the balance of local vs non-local native inter- 
actions. Improved correlation may still be expected for 
the proteins with a-helical or a/ (3 architecture [4]. 

Although the importance of native state stability in 
determining kp has been demonstrated, limited experi- 
mental data have been used to argue that the length of 
proteins (i.e., the number of amino acids N) should not 
affect Uf [1]- From the polymer physics perspective, this 
is somewhat surprising, because the relaxation rates even 
for ideal polymer chains depend on N . For example, the 
largest relaxation time in a Rouse chain scales as N 2 [5] . 
Because the size range of single domain proteins is limited 
(typically less than about 200 residues), the dependence 
of kp on N cannot be sharply demonstrated. In pro- 
teins other factors, such as amino acid sequence and the 
nature of local and non-local interactions in the native 



state, could be more dominant. Nevertheless, the mere 
fact that proteins are polymers implies that N should 
play some role in determining the folding rates [6]. 

The dependence of kp on N has been investigated in 
a number of theoretical studies [7-13]. Several folding 
scenarios emerge depending on the characteristic folding 
temperatures, namely, the collapse temperature, Tg, the 
folding transition temperature, Tp, and the glass tran- 
sition temperature, T g [7,14]. For optimal folding, for 
which Tg Tp, Thirumalai has predicted that the fold- 
ing time Tp should scale with N as [7] 

Tp ~ N\ (1) 

The dimensionality dependent exponent A for two-state 
folders is expected to be between 3.8 and 4.2 [7]. Sim- 
ulation studies using Go lattice models (LMs) without 
side chains suggest a smaller value of about 3 [11,12]. 
These numerical studies are in broad agreement with 
the theoretical predictions. The heteropolymer nature 
of protein-like models could make A temperature depen- 
dent. For two-state optimized folders there is a relatively 
broad range of temperatures, where Tp remains relatively 
insensitive to T [15]. The N dependence of Tp outside 
this range may not obey Eq. (1) or A may be different. 

All of the numerical studies mentioned above have been 
done using LMs, in which each residue is represented by 
a bead confined to the vertices of an appropriate (usu- 
ally cubic) lattice. Side chain packing effects, which are 
crucial in the folding process, cannot be considered in 
this class of LMs. A simple way to include these in the 
context of LMs is to attach additional bead to each a- 
carbon atom in a sequence [16,17]. Thus, an amino acid 
consists now of two beads, one representing a backbone 
(BB) and the other - a side chain (SC). In this polypep- 
tide model there are 2N beads. If an appropriate hetero- 
geneous potential between side chains is included, then 
the cooperative transition reminiscent of folding can be 
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reproduced [18]. Thermodynamics and kinetics of lat- 
tice models with side chains (LMSC) have been recently 
reported [17,19,20]. 

In this paper we examine the effect of rotamer de- 
grees of freedom on the exponent A (Eq. (1)) using 
Go-like models [21]. In these highly simplified mod- 
els, which have received considerable attention in re- 
cent years, only interactions present in the native state 
are considered. Non-native interactions, which can play 
an important role in the thermodynamics and kinetics 
of folding [19,20,22], are ignored. Nevertheless, several 
studies have showed that the Go models provide a rea- 
sonable caricature of certain aspects of folding [19,23,24]. 
For this model system, we find that the exponent A is al- 
tered by rotamer degrees of freedom. More importantly, 
A depends on the details of interactions and, in this sense, 
is non-universal. As a technical byproduct of our inves- 
tigation we show that robust results for A are obtained 
only for those Monte Carlo move sets, which are effec- 
tively ergodic. 

II. METHODS 

Model: In the LMSC the energy of a conformation is 
[17] 

N N 

E = e bb \ 8 r bb a + e bs \ 5 r bs + 

»=1,J>»+1 i=l,jjti 
N 

£ss ^ 6rfj,a i (2) 
i— 1,J>« 

where e bbl e bs and e ss arc BB-BB, BB-SC and SC-SC 
contact energies. r^,r|? and rf? are the distances be- 
tween the i th and j th residues for the BB-BB, BB-SC 
and SC-SC pairs, respectively. Each lattice site can only 
be occupied by a single bead (BB or SC) so that the 
self-avoidance condition is satisfied. 

We consider two versions of the Go model. In the 
model GM1, Cbb,£bs and e ss are chosen to be -1 for 
native contacts and for non-native ones. In GM2, 
e bb = e bs = —0.2 and the values of e ss , which depend 
on the nature of amino acids, are given by Betancourt- 
Thirumalai statistical interaction potentials [25]. Thus, 
GM2 incorporates diversity in the interaction energies 
that is known to be important in the design of foldablc 
sequences. The fraction of hydrophobic residues in a se- 
quence is approximately 0.5 as in wild-type proteins. By 
setting all of non-native contact interactions to 3.0 (this 
value is larger than any of Betancourt-Thirumalai cou- 
plings [25]), we ensure that non-native interactions do 
not contribute to folding. Thus, for all practical pur- 
poses both models exhibit Go-like characteristics. 

The maximum length of sequences N examined in our 
work is 40. Investigation of scaling behavior of LMSC 



beyond this limit is computationally expensive. However, 
scaling trends may be reasonably established for LMs 
without side chains even for N < 40 (see Fig. (2) in 
[11]). Therefore, we are fairly confident that considering 
LMSC with N < 40 would not hamper our ability to 
analyze scaling of folding times with N. 
Sequences: Protein- like sequences were obtained using 
the standard Z-score optimization or by minimizing the 
energy of the native state [26,27]. Z-score optimization 
is based on Monte Carlo simulations in sequence space 
aimed at minimizing Z = (Eq — E ms )/5, where Eq and 
E ms are the energy of a sequence in the target confor- 
mation and the average energy of misfolded structures, 
respectively, and S is the dispersion in the native con- 
tact energies. We took E ms — c < B >, where c is 
average number of nearest neighbor contacts in the man- 
ifold of misfolded structures, and < B > is the average 
contact energy for a given sequence. The Monte Carlo 
simulations in sequence space were done using simulated 
annealing protocol by generating 20 independent trajec- 
tories for each sequence. The optimized sequence is the 
one with the lowest Z-score. For each N, the target 
conformations for GM1 and GM2 are identical, so the 
effect of the "realistic" SC interactions can be directly 
addressed. 

Despite its simplicity Z-score and energy optimizations 
[26,27] have been proven to be effective techniques for 
generating designed foldable sequence with N < 100 [11]. 
Other, more elaborate, technical methods for designing 
lattice protein-like sequences have been proposed [28]. 
For the scope of our paper these methods are not rele- 
vant, because we are only interested in generating fold- 
able sequences spanning a reasonable range of N. 
Move sets in Monte Carlo simulations: To assess 
the efficiency of the Monte Carlo (MC) simulations and 
to check the robustness of the results we used four dis- 
tinct move sets (Fig. (1)). Move set MSI involves only 
single corner (and also tail) monomer moves. In addition 
to single flips, the standard move set (SMS) also contains 
the crankshaft motion [29]. Fig. (1) shows the moves in 
the set MS2, which includes SMS and additional two- 
monomer moves (not crankshaft ones). Move set MS3 is 
implemented as described in [28]. The validity of MS3 
has been verified for short sequences without side chains 
by comparing MC results with those obtained by full enu- 
meration of lattice conformations, which is tantamount 
to performing ensemble average. Thus, for MS3 ergodic- 
ity has been established and implementation of detailed 
balance condition has been also discussed [28]. We have 
found that due to its flexibility MS3 is far more efficient 
than others. The purpose of using different MC move 
sets is to ensure the robustness of our results. In our 
study one MC step consists of N MC moves, i.e., on an 
average each bead in a sequence is attempted to move 
once during one MC step. 

Computation of folding times and temperatures: 
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For each sequence and temperature we computed the dis- 
tribution of first passage times tu, where Tu is the num- 
ber of MC steps needed to reach the native state starting 
from the unfolded state i. The structure is considered 
folded, if the overlap function \ = [17]. The folding 
time Tp = jj J2iti T H' wnere M = 100 is the number 
of individual trajectories. The folding time to reach the 
native backbone, T F b , has been calculated in a similar 
way. 

In our study, folding times Tp have been obtained at 
the temperature of fastest folding T m j„ located by scan- 
ning a temperature range for each sequence. If the folding 
transition temperature Tp is identified with the maxi- 
mum in the fluctuations of overlap function [17], then we 
expect Tp ss T min . This conclusion, which naturally fol- 
lows from the dual requirement of the stability of protein 
native state and its kinetic accessibility, is illustrated in 
Fig. (2) for two-state LMSC sequence with N = 15. The 
observation that Tp w T m „ serves as a convenient oper- 
ational condition to pinpoint the temperature of folding 
simulations without the need of expensive equilibrium 
simulations. Furthermore, it has been shown [15] that 
for highly optimized sequences there is a large plateau 
in the temperature dependence of Tp. This flat tem- 
perature range typically includes T m i n , Tp, and collapse 
temperature Tg. We expect that scaling behavior would 
be similar for these temperatures. 

In a previous study [12] that has examined the N de- 
pendence of folding times, Tp has been defined as the 
temperature, at which the probability of occupancy of 
the microscopic native conformation is 0.5. Such a highly 
restrictive definition of Tp is not physically meaningful. 
It is realized that the fluctuations in the overlap function 
or the equilibrium fraction of native contacts are more 
appropriate quantities for defining Tp [17,20,30]. The 
physically relevant definition, which also coincides with 
the experimental ones, is based on the notion of the na- 
tive state as a collection of structurally similar conforma- 
tions belonging to the native basin of attraction (NBA). 

III. RESULTS 

We have monitored the length dependence of the fold- 
ing times Tp (which register folding of the entire native 
structure) as well as T F b , which is the average first passage 
time to the folded backbone. The characteristic U-shapc 
for the temperature dependence of Tp noted for optimized 
sequences [11] is also found for T F b (Figs. (3)). Over the 
range of temperatures, where Tp remains roughly con- 
stant, Tp i=a T h F h . However, this is not the case at low 
and high temperatures (see below). The extent of the 
plateau region in T is larger for GM1 (Fig. (3)). Narrow 
shape of the temperature dependence of Tp for GM2 re- 
sembles that computed for the random sequence without 



side chains (Fig. (3) in [11]). In the rest of the paper we 
focus on the variations of the folding times at T m j„ w Tp. 

The A-dependence of folding times obtained by four 
types of MC move sets for GM1 at T = T m i n is presented 
in Fig. (4). The number of targets used for N = 9, 15, 
18, 24, 28, 32, and 40 arc 100, 50, 50, 20, 17, 15, and 15, 
respectively. For MSI the calculations were performed 
only up to N = 32. Ergodic move sets (i.e., SMS, MS2, 
and MS3), which efficiently sample the conformational 
space, were used for TV = 40. 

It is well known that MSI based on single monomer 
corner and tail flips is not ergodic [29]. Consequently, 
the folding rates obtained using MSI are not reliable. For 
GM1 we computed A = 3.7 ± 0.3, 3.6 ± 0.2 and 3.6 ± 0.2 
for MS2, MS3 and SMS, respectively (Fig. (4)). The 
power law also holds for folding of the backbone with 
A;,;, ~ A at the simulation temperatures T m i n . Interest- 
ingly, exponents A and Abb obtained by the three ergodic 
move sets are almost the same, which establishes the ro- 
bustness of our results. Because the scaling exponent A 
for Go LMSC models is higher than for those without 
SCs [11,12], we assume that dense packing of side chain 
creates additional barriers to folding. We also note that 
A for GM1 sequences is in the range proposed by Thiru- 
malai for fast folding sequences [7]. 

Side chains alter A values for GM2 with realistic in- 
teractions. It is still possible to fit the folding times 
obtained for GM2 using a power law with the large ex- 
ponent A w 6.5 ± 0.4, which is similar to that reported 
for random sequences without side chains [11]. Because 
the same target conformations for both GM1 and GM2 
simulations have been used, the large difference in ex- 
ponents is due to the diversity of interactions. Thus, 
our study shows that, even though GM2 sequences ex- 
hibit two-state folding, heterogeneity in the interactions 
between side chains (as in GM2) adds roughness to the 
underlying energy landscape. Furthermore, side chain 
packing also introduces enhanced topological frustration 
compared to LM without side chains. As a consequence 
scaling behavior of those sequences resembles that of ran- 
dom sequences without side chains [11]. In fact, for ran- 
dom sequences without side chains Gutin et al had not 
ruled out the possibility of exponential scaling [11]. 

From the physical viewpoint large A values are indica- 
tive of an activated process with a free energy barrier 
scaling slower than N [10]. In particular, using the pro- 
posal that the activation barrier scales as y/N [7] , we find 
that tf for GM2 can be fit by t f ~ e pVN (see Fig. (5)). 
Recent analysis of experimental data also suggests that 
the barrier height scales as N a with a = 0.607 ± 0.179 
[6], which is consistent with \/N [7] or A 2 / 3 [10] scaling. 

Although the scaling exponents are the same for all 
three ergodic move sets, the folding times vary. The de- 
pendence of t| M5 /t# /S2 and t§ ms /t^ IS3 on N for GM1 
shows that the folding times obtained by the standard 
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move set SMS [20] are about twice as long as those for 
MS2 and MS3. Because r# MS /r^ S3 > t£ ms /t^ S2 , we 
conclude that the MS3 dynamics is the most efficient for 
folding in LMs. This is not unexpected, because MS3 
incorporates flexible choice of multimcric moves, which 
efficiently sample local conformational space. 

Both GM1 and GM2 show that there are no signif- 
icant differences in the time scales for backbone and 
side chain folding in the plateau temperature range, i.e. 
Tp ' /tf ~ 0(1) (Fig. (3)). However, outside this temper- 
ature range r|? starts to deviate significantly from Tp . Of 
particular interest are the temperatures T < Tp, where 
NBA is populated. This shows that at low temperatures 
(T/Tp is relatively small) the backbone ordering occurs 
considerably faster than the folding of side chains. The 
rate determining step is associated side chain ordering, 
which might involve transitions over barriers of varying 
heights. Thus, there may be a relatively narrow temper- 
ature window (for example, 0.9 < T/Tp < 1.2 in Fig. 
(2)), in which T F h w t f . 

IV. CONCLUSIONS 

We have studied the scaling properties of Go lattice 
sequences with SCs using four different types of Monte 
Carlo moves. The exponents in the power laws describ- 
ing the scaling of folding times with sequence length N 
are sensitive to the ergodicity of the move sets and in- 
teraction details [11]. The move set MS3, which is based 
on flexible selection of multimeric moves, is found to be 
the most efficient for studying folding in LMs. Strong de- 
pendence of folding times for LMSC on sequence length 
is attributed to side chain packing. The presence of side 
chains interacting via diverse potentials gives rise to in- 
trinsic roughness in the underlying energy landscape. 
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FIG. 1. MC move sets examined in this study: (a) MSI is 
based on single monomer corner flips (including tail ones); (b) 
SMS incorporates MSI and the crankshaft moves; (c) MS2 
involves SMS and additional two-monomer moves; (d) MS3 
contains MS2 and three- monomer moves [28]. 



FIG. 3. The temperature dependence of tf and t f for 
N = 18 GM1 and GM2 sequences, both of which share the 
same native conformation. Data are obtained using MS3. The 
arrows indicate T m i n , the temperature at which tf is mini- 
mal. The GM2 sequences have a narrower plateau region than 
GM1. Therefore, GM2 sequences are not as well optimized as 
GM1. 
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FIG. 2. The characteristic U-shape dependence of the fold- 
ing time tf on temperature measured in the units of the fold- 
ing temperature Tf for the LMSC sequence A (N — 15) stud- 
ied in [17]. Tf is computed using multiple histogram method 
as the temperature, at which the fluctuations in the overlap 
function reach maximum. The temperature, at which tf is 
minimum, is T miTl — 1.072>. An observation that T F m T m in 
may be used for crude estimation of Tf for a large dataset of 
sequences. Almost flat region in the vicinity of T F is indicated 
by a dotted box [15] (see also Fig. (3)). 
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FIG. 4. Scaling of t f and rf for GM1 at T = T min . The 
results arc shown for all four types of MC move sets. The solid 
symbols indicate folding times for the entire sequence tf, the 
open ones represent backbone folding times T F b . Straight solid 
and dotted lines are corresponding power law fits. Scalings 
of tf and r F b are virtually identical, i..e, A fa \bb- The scal- 
ing exponents for MSI move set substantially differ from the 
others that reflects its inherent lack of ergodicity. The results 
are averaged over 100, 50, 50, 20, 17, 15 and 15 target con- 
formations for N=9, 15, 18, 24, 28 , 32 and 40, respectively. 
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FIG. 5. Scaling of tf with N for GM2 sequences computed 
at T = T m in using MS3. The solid line represents the expo- 
nential fit e 13 ^ with (3 — 3.1 ±0.5. Exponential fit provides 
a physically sound interpretation for such scaling behavior 
based on barrier crossing. 
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